Weighting and normalisation of synchronous HMMs for audio-visual speech recognition
نویسندگان
چکیده
In this paper, we examine the effect of varying the stream weights in synchronous multi-stream hidden Markov models (HMMs) for audio-visual speech recognition. Rather than considering the stream weights to be the same for training and testing, we examine the effect of different stream weights for each task on the final speech-recognition performance. Evaluating our system under varying levels of audio and video degradation on the XM2VTS database, we show that the final performance is primarily a function of the choice of stream weight used in testing, and that the choice of stream weight used for training has a very minor corresponding effect. By varying the value of the testing stream weights we show that the best average speech recognition performance occurs with the streams weighted at around 80% audio and 20% video. However, by examining the distribution of frame-by-frame scores for each stream on a leftout section of the database, we show that these testing weights chosen primarily serve to normalise the two stream score distributions, rather than indicating the dependence of the final performance on either stream. By using a novel adaption of zero-normalisation to normalise each stream’s models before performing the weighted-fusion, we show that the actual contribution of the audio and video scores to the best performing speech system is closer to equal that appears to be indicated by the un-normalised stream weighting parameters alone.
منابع مشابه
Improved bimodal speech recognition using tied-mixture HMMs and 5000 word audio-visual synchronous database
This paper presents methods to improve speech recognition accuracy by incorporating automatic lip reading. The paper improves lip reading accu racy by following approaches; 1)collection of im age and speech synchronous data of 5240 words, 2)feature extraction of 2・dimensional power spect日 around a mouth and 3)sub-word unit HMMs with tied-mixture distribution(Tied-Mixture HMMs). Ex periments ...
متن کاملProduct HMMs for audio-visual continuous speech recognition using facial animation parameters
The use of visual information in addition to acoustic can improve automatic speech recognition. In this paper we compare different approaches for audio-visual information integration and show how they affect automatic speech recognition performance. We utilize Facial Animation Parameters (FAPs), supported by the MPEG-4 standard for the visual representation as visual features. We use both Singl...
متن کاملFused HMM-adaptation of multi-stream HMMs for audio-visual speech recognition
A technique known as fused hidden Markov models (FHMMs) was recently proposed as an alternative multi-stream modelling technique for audio-visual speaker recognition. In this paper we show that for audio-visual speech recognition (AVSR), FHMMs can be adopted as a novel method of training synchronous MSHMMs. MSHMMs, as proposed by several authors for use in AVSR, are jointly trained on both the ...
متن کاملAsynchrony modeling for audio-visual speech recognition
We investigate the use of multi-stream HMMs in the automatic recognition of audio-visual speech. Multi-stream HMMs allow the modeling of asynchrony between the audio and visual state sequences at a variety of levels (phone, syllable, word, etc.) and are equivalent to product, or composite, HMMs. In this paper, we consider such models synchronized at the phone boundary level, allowing various de...
متن کاملFused HMM adaptation of synchronous HMMs for audio-visual speaker verification
A technique known as fused hidden Markov models (FHMMs) was recently proposed as an alternative multi-stream modelling technique for audio-visual speaker recognition. In this paper, we will show that instead of being treated as separate modelling technique, FHMMs can be adopted as a novel method of training synchronous hidden Markov models (SHMMs). SHMMs are traditionally jointly trained on bot...
متن کامل